Genomic Region Detection via Spatial Convex Clustering
نویسندگان
چکیده
Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiple chromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. We use ideas from fusion penalties and convex clustering to introduce a method for Spatial Convex Clustering, or SpaCC. Our method is specifically tailored to detecting multi-subject regions of methylation, but we also test our approach on the wellstudied problem of detecting segments of copy number variation. We formulate our method as a convex optimization problem, develop a massively parallelizable algorithm to find its solution, and introduce automated approaches for handling missing values and determining tuning parameters. Through simulation studies based on real methylation and copy number variation data, we show that SpaCC exhibits significant performance gains relative to existing methods. Finally, we illustrate SpaCC’s advantages as a pre-processing technique that reduces large-scale genomics data into a smaller number of genomic regions through several cancer epigenetics case studies on subtype discovery, network estimation, and epigenetic-wide association. ∗[email protected] †[email protected] 1 ar X iv :1 61 1. 04 69 6v 1 [ st at .A P] 1 5 N ov 2 01 6
منابع مشابه
Modified Convex Data Clustering Algorithm Based on Alternating Direction Method of Multipliers
Knowing the fact that the main weakness of the most standard methods including k-means and hierarchical data clustering is their sensitivity to initialization and trapping to local minima, this paper proposes a modification of convex data clustering in which there is no need to be peculiar about how to select initial values. Due to properly converting the task of optimization to an equivalent...
متن کاملAssessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملVisual Clustering and Exploration of Splicing Sites using DNA Curvature Criteria
The aim of this paper is to explore the clustering capabilities of our visual genome explorer software. Genome3DExplorer is a new modeling and software solution to explore textual and factual genomic data. It offers a powerful and user-centered visualization of this information within an immersive environment. The visualization is based on a graphical paradigm that automatically helps to build ...
متن کاملSpatial smoothing and hot spot detection for CGH data using the fused lasso.
We apply the "fused lasso" regression method of (TSRZ2004) to the problem of "hot- spot detection", in particular, detection of regions of gain or loss in comparative genomic hybridization (CGH) data. The fused lasso criterion leads to a convex optimization problem, and we provide a fast algorithm for its solution. Estimates of false-discovery rate are also provided. Our studies show that the n...
متن کاملSpatial Clustering of Multivariate Genomic and Epigenomic Information
The combination of fully sequence genomes and new technologies for high density arrays and ultra-rapid sequencing enables the mapping of generegulatory and epigenetics marks on a global scale. This new experimental methodology was recently applied to map multiple histone marks and genomic factors, characterizing patterns of genome organization and discovering interactions among processes of epi...
متن کامل